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(57) Abstract: A method and apparatus for decoding an input MPEG (Fig. 1) video stream are provided that includes a core pro- 
cessor with a very long instruction word (VLIW) processor (Fig. 2A, 21) and a co-processor that includes a variable length decoder 
(VLD) for decoding the MPEG video stream (Fig. 2A, 24). The input MPEG video stream is organized into macroblocfcs, wherein 
each macroblock includes a header for a macroblock that is not decoded, and encoded data for a macroblock whose header is pre- 
viously decoded by VLD (Fig. 5). Thereafter, VLD decodes the encoded video data of a first macroblock whose header has been 
decoded, and decodes the header of a second (current) macroblock (Fig. 6). VLIW then performs motion compensation on a current 
macroblock based upon reference data of a previously decoded macroblock (Fig. 7). VLIW also adds a take slice start code and fake 
macroblock data at the end of each picture into the input MPEG video data stream (Fig. 3, S305); and utilizes the fake slice start code 
and take macroblock data to skip to a next slice (Fig. 3, S306). The fake macroblock data indicates an error to the VLD stopping the 
decoding process until the core processor clears the interrupt and reinitiates decoding of a selected macroblock (Fig. 3, 310). 
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METHOD AND APPARATUS FOR DECODING MPEG VIDEO SIGNALS 

RELATED APPLICATIONS: 

5 The present Application is related to the U.S. patent application entitled 

"METHOD AND APPARATUS FOR DECODING MPEG VIDEO SIGNALS WITH 
CONTINUOUS DATA TRANSFER" Serial No. 09/481,603, filed January 12, 2000, 
and assigned to the Assignee of the present invention. The disclosure of the patent 
application "METHOD AND APPARATUS FOR DECODING MPEG VIDEO 
10 SIGNALS WITH CONTINUOUS DATA TRANSFER" is hereby incorporated by 
reference herein in its entirety. 

The present Application is also related to the U.S. patent application entitled 
"METHOD AND APPARATUS FOR DECODING MPEG VIDEO SIGNALS 
USING MULTIPLE DATA TRANSFER UNITS", Serial No. 09/481,336, filed 
15 Janaury 12, 2000, and assigned to the Assignee of the present invention. The 
disclosure of the patent application "METHOD AND APPARATUS FOR 
DECODING MPEG VIDEO SIGNALS USING MULTIPLE DATA TRANSFER 
UNITS" is hereby incorporated by reference herein in its entirety. 

20 FIELD OF THE INVENTION 

The present invention relates to video decoders, and more particularly, to a 
method and apparatus for decoding encoded MPEG video data stream into raw video 
data. 

BACKGROUND OF THE INVENTION 
25 MPEG Background 

Moving Pictures Experts Group ("MPEG") is a committee under the 
International Standards Organization ("ISO") and the International Electronics 
Commission ("DEC") that develops industry standards for compressing/decompressing 
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video and audio data Two such standards that have been ratified by MPEG are called 
MPEG-1 and MPEG-2. MPEG-1 is documented in ISO/EEC 1 1 172 publication and is 
fully incorporated herein by reference. MPEG-2 is disclosed in ISO/DEC publication 
1 1 172 and 13818, and is also incorporated herein by reference. 

5 MPEG-1 was developed with the intent to play back compressed video and 

audio data either from a CD-ROM, or transfer compressed data at a combined coded 
bit rate of approximately 1.5 Mbits/sec. MPEG-1 approximates the perceptual quality 
of a consumer videotape (VHS). However, MPEG-1 was not intended for broadcast 
quality. Hence, MPEG-1 syntax was enhanced to provide efficient representation of 
10 interlaced broadcast video signals. This became MPEG-2. 

MPEG-1 and MPEG-2 can be applied at a wide range of bit rates and sample 
rates. Typically MPEG-1 processes data at a Source Input Resolution (SIF) of 352 
pixels x 240 pixels at 3 0 frames per second, at a bit rate less than 1 . 5 Mbits/s. MPEG- 
2, developed to serve the requirements of the broadcast industry, typically processes 
352 pixels x 240 lines at 30 frames/sec ("Low Level"), and 720 pixels/line x 480 lines 
at 30 frames/sec ("Main Level"), at a rate of approximately 5 Mbits/sec. 

MPEG standards efficiently represent video image sequences as compactly 
coded data. MPEG standards describe decoding (reconstruction) processes by which 
encoded bits of a transmitted bit stream are mapped from compressed data to the 
original raw video signal data suitable for video display. 

MPEG ENCODING 

MPEG encodes video sequences such that RGB color images are converted to 
YUV space with two chrominance channels, U and V. A MPEG bitstream is 
compressed by using three types of frames: I or intra frames, P or predicted frames, 
25 and B or bi-directional frames. I frames are typically the largest frames containing 
enough information to qualify as entry points. Predicted frames are based on a 
previous frame and are highly compressed. Bi-directional frames refer both to future 
and previous frames, and are most highly compressed. 
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MPEG pictures can be simply infra-coded, with no motion 
compensation prediction involved, forward coded with pel prediction projected 
forward in time, backward coded with pel prediction backward in time, or bi- 
directionally coded, with reference to both forward and backward pictures. Pictures 
5 can be designated as I (formed with no prediction involved as a still image from the 
image data originating at the source, e.g., a video camera), P (formed with prediction 
from forward pictures) or B (formed with prediction both from a forward picture 
and/or a backward picture). An example of display sequence for MPEG frames might 
be shown as follows: 
10 IBBPBBPBBPBBIBBPBBPB 

Each MPEG picture is broken down into a series of slices and each slice is 
comprised of a series of adjacent macroblocks. 

MPEG pictures can be progressive sequence or interlaced. For the interlaced 
GOP comprises of field and/or frame pictures. For frame pictures, macrpblock 
prediction scheme is based upon fields (partial frames) or complete frames. 

MPEG encoder decides how many pictures will occur in a GOP, and how 
many B pictures will be interleaved between each pair of I and P pictures or pair of P 
pictures in the sequence. Because of picture dependencies, i.e., temporal compression, 
the order in which the frames are transmitted, stored or retrieved, is not necessarily the 
video display order, but rather an order required by the decoder to properly decode 
pictures in the bitstream. 

MPEG compression employs two fundamental techniques: Motion 
compensation and Spatial Redundancy. Motion compensation determines how 
predicted or bi-directional frames relate to their reference frame. A frame is divided 
into 16 x 16 pixel units called macroblocks. The macroblocks in one frame are 
compared to macroblocks of another frame, similarities between the frames are not 
coded. If similar macroblocks shift position between frames, the movement is 
explained by motion vectors, which are stored in a compressed MPEG stream. 
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Spatial redundancy technique reduces data by describing differences within 
corresponding macroblocks. Spatial compression is achieved by considering the 
frequency characteristics of a picture frame. The process uses discrete cosine 
transform ("DCT") coefficients that spatially tracks changes in color and brightness. 
5 The DCTs are done on 8x8 pixel blocks. The transformed blocks are converted to the 
"DCT domain", where each entry in the transformed block is quantized with respect to 
a set of quantization tables. Huffman coding and zig-zag ordering is used to transmit 
die quantized values. 

MPEG DECODING 

10 MPEG Video decoders are known in the art The video decoding process is 

generally the inverse of the video encoding process and is employed to reconstruct a 
motion picture sequence from a compressed and encoded bitstream. Generally MPEG 
video bitstream data is decoded according to syntax defined by MPEG standards. The 
decoder must first identify the beginning of a coded picture, identify the type of 

1 S picture, and then decode each individual macroblock within a particular picture. 

Generally, encoded video data is received in a rate or a video buffer verifier 
("VBV"). The data is retrieved from the channel buffer by a MPEG decoder or 
reconstruction device for performing the decoding. MPEG decoder performs inverse 
scanning to remove any zig zag ordering and inverse quantization to de-quantize the 
20 data. Where frame or field DCTs are involved, MPEG decoding process utilizes frame 
and field Inverse Discrete Cosine Transforms CTDCTs") to decode the respective 
frame and field DCTs, and converts the encoded video signal from the frequency 
domain to the spatial domain to produce reconstructed raw video signal data. 

MPEG decoder also performs motion compensation using transmitted motion 
25 vectors to reconstruct temporally compressed pictures. When reference pictures such 
as I or P pictures are decoded, they are stored in a memory buffer. When a 
reconstructed picture becomes a reference or anchor picture, it replaces the oldest 
reference picture. When a temporally compressed picture, also referred to as a target 
frame, is received, such as P or B picture, motion compensation is performed on the 
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picture using neighboring decoded I or P reference pictures. MPEG decoder examines 
motion vector data, determines the respective reference block in the reference picture, 
and accesses the reference block from the frame buffer. 

After the decoder has Huffman decoded all the macroblocks, the resultant 
5 coefficient data is then inverse quantized and operated on by an ED CT process to 

transform macroblock data from a frequency domain to data in space domain. Frames 
may need to be re-ordered before they are displayed in accordance with their display 
order instead of their coding order. After the frames are re-ordered, they may then be 
displayed on an appropriate device. 

10 Fig. 1 shows a block diagram of atypical MPEG decoding system, as is known 

in the art Shown in Figure 1 are a MPEG Demux 10, a MPEG video decoder 11 and 
an audio decoder 12. MPEG Demux 1 0 receives encoded MPEG bit stream data 1 3 
that consists of video and audio data, and splits MPEG bit stream data 1 3 into MPEG 
video stream data 14 and MPEG audio stream data 16. MPEG video stream data 14 is 

1 5 input into MPEG video decoder 1 1 , and MPEG audio stream data 1 6 is input into an 
MPEG audio decoder 12. MPEG Demux 10 also extracts certain timing information 
15, which is provided to video decoder 11 and audio decoder 12. Timing information 
IS enable video decoder 11 and audio decoder 12 to synchronize an output video 
signal 17 (raw video signal data) from video decoder 1 1 with an output audio signal 

20 18 (raw audio data) from audio decoder 12. 

MPEG video decoders may have a core processor for reconstructing decoded 
MPEG video data into raw video signal data, and a co-processor CVLD") for doing 
variable length decoding of the MPEG video data stream. A direct memory access 
controller ("DMA") either associated with or incorporated into a host computer, or 
25 associated with or incorporated into the MPEG video decoder, manages data transfer 
between the core processor, VLD and various memory buffers. 

Current decoding processors such as those manufactured by Equator 
Technology Inc. ("ETT) process data on an individual block by block basis, rather 

5 
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than a macroblock level. For component block by block decoding and transfer, the 
speed of the processing of an entire macroblock may be limited by data transfer speed 
For example, if a data transfer mechanism is able to transfer 2 bytes per cycle, for a 
macroblock with six (6) 8 x 8 blocks comprising of 768 bytes of data, will require 384 
5 cycles and an additional u y n number of cycles for overhead delay per transfer set 
Hence, block by block decoding slows the overall decoding process. 

Currently more DMA instructions are required to process each block of data 
vis-&-vis processing an entire macroblock of data. Also, conventional MPEG 
techniques have multiple waits for different DMA transfers and hence a significant 
1 0 amount of lead-time occurs that slows the overall decoding process. 

Also, current decoding techniques adversely impact parallelism between VLD 
and the core processor and have inefficient VLIW pipelines. Furthermore, currently, 
VLD can only detect errors and is not able to correct those errors. 

Therefore, a decoding system is needed that can efficiently transfer data 
15 between VLD and core processor, and also optimally utilize the resources of both 
processors, and perform error recovery in the core processor. 



SUMMARY OF THE INVENTION 

The present invention addresses the foregoing drawbacks by providing an 

20 apparatus and method that synchronizes data exchange between a core processor that 
includes a very long instruction word (VLIW) processor, and a variable length decoder 
(VLD) of an MPEG video decoder, and enhances core processor and co-processor 
parallelism. According to one aspect, the present invention provides an incoming 
compressed and encoded MPEG video bit stream to a video decoder on a picture by 

25 picture basis. The input MPEG video stream data is organized into pictures and slices. 
Thereafter, VLIW adds a Me slice start code and fake macroblock data at the end of 
each MPEG input picture, and VLD utilizes the fake slice start code and fake 
macroblock data to skip to a next picture. The fake macroblock data indicates an error 
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to VLD stopping the decoding process until the core processor reinitiates decoding of 
a selected slice. 

VLIW then provides the input MPEG coded data stream to VLD on a picture 
by picture basis. VLD decodes the header of a current macroblock and the video data 
5 of a previous macroblock whose header has been decoded. The encoded MPEG video 
data includes DCT coefficients. 

Thereafter, VLD transfers the current decoded header along with the decoded 
DCT coefficients of a previously decoded macroblock to the core processor on a 
macroblock by macroblock basis. VLIW performs motion vector reconstruction based 
1 0 upon decoded header data, inverse discrete cosine transforms based upon the decoded 
DCT coefficients, and motion compensation based upon reference data of a previous 
macroblock(s), and converts the data into raw video data. 

The present invention has numerous advantages over the existing art. The 
decoding of an entire macroblock of video data assists in maint aining continuos and 
15 efficient pipelined operation. Since a macroblock includes a macroblock header for a 
current macroblock and DCT coefficients for a previous macroblock, VLIW can easily 
locate data for motion vector reconstruction and compensation. 

The foregoing aspects of the invention also simplify the decoding and 
reconstruction process because VLD decodes a macroblock header for a current 

20 macroblock, e.g. MB(i) and stores the decoded header data with a macroblock already 
decoded, e.g. MB(i-l), and transfers the decoded header and macroblock data (DCTs) 
to a data cache for access by VLIW. Tins enables VIJW to acquire 
macroblock prior to performing motion compensation and IDCTs. This reduces idle 
time and improves decoding efficiency. VLIW architecture also allows simultaneous 

25 data processing and data transfer, and hence improves parallelism. Furthermore, since 
VLIW controls VLD operations, error handling is streamlined and hence improves 
performance. 

7 
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This brief summary has been provided so that the nature of the invention may 
be understood quickly. A more complete understanding of the invention can be 
obtained by reference to the following detailed description of the preferred 
embodiments thereof in connection with the attached drawings. 

BRIEF DESCRIPTION OF THE DRAWING 

Fig. 1 shows a block diagram of a typical MPEG decoding system known in 

the art 

Fig. 2A shows a block diagram of a MPEG video decoder according to one 
aspect of the present invention. 

Figure 2B shows a block diagram of data cache 22 memory buffers. 

Fig. 3 shows a flow diagram of process steps for decoding MPEG video stream 
by using a fake slice start code and fake macro-block data. 

Fig. 4 is an example of macroblock data format with fake start code and fake 
macro block data. 

Fig. 5 shows an example of a macroblock data structure. 

Fig. 6 shows a flow diagram of process steps according to one aspect of the 
present invention for decoding an MPEG video stream on a macroblock by 
macroblock basis. 

Figure 7 shows process steps for performing motion compensation and motion 
vector reconstruction of a decoded output video stream. 

Fig. 8A-8L shows a flow chart according to another aspect of the present 
invention illustrating the general processing, and groups of processes performed by 
various components of a MPEG video decoder. 

The use of similar reference numerals in different Figures indicates similar or 

8 
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Identical items. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 
Overall Architecture: 

5 Fig. 2A shows a schematic view of an MPEG video decoder 1 1, according one 

aspect of the present invention. MPEG video decoder 11 has a core processor 20, 
which includes a very long instruction word CVUW) processor 21. VLIW processor 

21 utilizes instructions that are grouped together (Le., very long) at the time of 
compilation of a computer program. As is well known in the art of VLIW processors, 

10 very long instructions are fetched and segregated for execution by VLIW processor 21, 
and dispatched to independent execution units. 

VLIW processor 21 is connected to a data cache memory 22 over a bi- 
directional internal bus 23. VLIW 21 can read input MPEG video stream 14 buffered 
in VBV 25 contained within a memory device for example, SDRAM 26 which also 
1 5 includes a frame buffer 40 whose functionality is discussed in detail below. 

MPEG video decoder 11 also includes a co-processor 23a. Co-processor 23a 
has a variable length decoder ("VLD") 24 which decodes (Huffman decodes) 
incoming encoded MPEG video stream 1 4 to produce decoded MPEG video data from 
which core processor 20 can reconstruct and output raw video data. Co-processor 23a 

20 also has a memory ("CM1") 29 that has at least two buffers B0 and Bl to store at least 
two sets of macroblock data. CM1 29 is connected to VLD 24 over a bi-directional 
bus 3 0 and is also connected to a Direct Memory Access ("DMA") transfer unit, DS 1 
31 , over a bus 32. DS1 31 in turn is also connected to data cache memory 22 via a bi- 
directional bus 33, and transfers data from CM1 29 memory buffers to data cache 22 

25 memory buffers. Figure 2B, as described below shows a block diagram of data cache 

22 with various memory buffers. 

VLD 24 has an input/output ("I/O") section, a GetBits engine ("GB") 28. 

VBV 25 supplies incoming MPEG video stream 1 4 to VLD 24 through DS0 27, where 

DS0 27 is another Direct Memory Access ("DMA") unit channel used for transferring 

9 
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data between VBV 25 and GetBits engine 28 via buses 34 and 47. GetBits engine 28 
gets MPEG coded video bit stream 1 4 and transfers the data to VLD 24 through an 
input buffer (not shown). 

VLIW processor 21 communicates command signals to DSO 27 over a 
5 command signal line 35. VLIW 21 can also read/write to CM1 29 over bus 36 and 
when VLIW 21 writes to CM1 29, VII)24 c^inteipret^^vrites^asacommand 
One such command is the "GO* command that allows VLD 24 to start decoding a 
macroblock. Also, VLD 24 can send data transfer commands to DS 1 31 over 
command signal line 37. 

10 It is noteworthy that core processor 20, co-processor 23a including all the data 

transfer elements can be integrated on a single chip. An example of such a chip is the 
MAP 1000A sold by Equator Technology. 

Fig. 2A also shows various DMA elements utilized for storage and transfer of 
video data. Fig. 2A shows frame buffer 40, that receives output reconstructed raw 

15 video signal data from data cache memory 22 on a macroblock by macroblock basis 
via DMA transfer unit DS3 39, over buses 42 and 45. DS3 39 has three paths, 
designated for illustration purposes as DS3_0, DS3 1 and DS3_2 that allows 
simultaneous data transfer from data cache 22 to It is noteworthy that 

the invention is not limited to a three path DMA transfer unit Frame buffer 40 also 

20 provides macroblock reference data for motion compensation to VLIW processor 21 
through DMA transfer unit DS2 38, over buses 41 and 46. 

Figure 2B shows a block diagram of various memory buffers that can be 
included in data cache 22. Figure 2B shows memory buffers MBJB0', MBJB1* and 
MB_2' to receive data from CM1 29viaDS131. Also shown are buffers MCB0', 
25 andMCBrto receive and store reference data for motion compensation from frame 
buffer 40 via DS2 38. Data cache 22 includes output memory buffers designated as 
OUT_B0', OUT_Bl ' and OUTB2' for storing decoded raw video data. It is 
noteworthy that all three buffers can transfer data simultaneously via DMA DS3 39. 

10 
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It is noteworthy that in one embodiment command lines/buses 34,35,37,41, 
42,43, and 44 can be integrated into a single bus. Also buses 32 and 33 can be 
included in a single bus, and furthermore buses 45, 45 A, 46 and 47 can be included in 
a single bus. In another embodiment all the command lines/buses, namely, 
5 34,35,37,41,42,43,44, 45^ 45^ 45 47 may ^ included on a single bus. Figure 2A 
and Figure 2B show the logic layout of the various buses and command lines, as 
discussed above. 

Video Stream decoding using fake slice code 

Figure 3 is flow diagram showing process steps according to one aspect of the 
10 present invention for decoding MPEG video stream 14 by using a fake slice start code 
and fake macro-block data. 

In step S301 , store input MPEG video stream 1 4 in VBV 25 in a non-coherent 
mode, Le., no other copy of the data stream is made. 

In step S302, VLIW21 parses video bitstream data 14 stored in VBV 25 to 
15 search for the presence of start code of a picture. VLIW 21 also determines picture 
size ("picture_size") and stores the picture size in cache memory 22 . 

In step S 303, VLIW 21 reads input MPEG video stream 14. 

In step S 304, VLIW 21 parses input MPEG video stream 14 and finds the end 
location of the slice. VLIW 21 follows MPEG standards to identify markers in the 
20 input MPEG video stream 14, as start and end positions of pictures and slices. 

In step S 305, VLIW 2 1 adds fake slice start code and fake macroblock data at 
the end of a picture. The picture data is appended with fake slice start code and fake 
macroblock data to facilitate macroblock level decoding and error handling. Figure 4 
shows an example of a macroblock data format 41 with fake start code 42 and fake 
25 macro block data 43. It is noteworthy that the invention is not limited to the shown 
fake start code format, any other format can be used to insert fake slice code. Fake 
macroblock data 43 is a macroblock header for pictures that indicates an error in the 

11 
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marker bit and will cause VLD 24 to stop decoding a current macroblock, and await 
further instructions (a "GO" command) from VLIW 21. By appending a fake slice 
start code to the end of the picture, VLD 24 skips to the next picture without actually 
decoding the data in the present picture. 

5 In step S 306, VLIW 21 sets DSO 27 over control line 35 to transfer the 

encoded MPEG video stream 14 from VBV 25 to GetBits engine 28, and DSO 27 
transfers encoded MPEG video data 17 to GetBits engine 28. VLIW 21 sends a first 
slice start code to VLD co-processor 23 for the purpose of slice level synchronization 
and also to enable VLD 24 to skip to another slice in a picture. An entire picture is 
10 transferred This is the most efficient transfer mode, since a picture is the largest data 
entity. Transfer of smaller entities, such as a slice, results in a more complex pre- 
paring workload for VLIW 21 and results in a complex data transfer system that can 
slow down the overall decoding process. 

la step S307, DSO 27 transfers fake slice start code 42 and fake macroblock 
15 data 43 to GetBits engine 28. 

In step S 308, VLD 24 decodes the macroblock header for macroblock i (MB 
(i). Figure 5 shows an example of a macroblock data structure 500 that consists of a 
macroblock header 502 for a MB (i), and DCT coefficients 50 1 for the previously 
decoded macroblock MB 0-1). Figure 5 macroblock structure improves decoding 
20 efficiency because while VLD 24 decodes a current header, it also decodes the DCTs 
of a previous macroblock simultaneously. VLIW 21 can also perform Inverse Discrete 
Cosine Transforms and motion compensation on a current macroblock and 
simultaneously perform motion vector reconstruction on two previous macroblock. 
This improves parallelism and also minimizes the number of memory buffers. 

25 In step S 3 09, VLD 24 decodes DCTs for MB (i- 1 ). The decoding algorithms 

used by VLD 24 are those recited by established MPEG standards and disclosed in 
U.S. Patent Application , Serial Number 09/144, 693, titled "SYSTEM AND 
METHOD FOR DECODING A VARIABLE LENGTH CODE DIGITAL SIGNAL", 
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filed on March 31 , 1998, and assigned to the present assignee, The techniques are 
incorporated herein by reference. 

In step S3 10, when commanded by VLIW 21, VLD 24 detects fake slice start 
code 42 and fake macroblock data 43 and in step S3 1 1 , VLD 24 waits for a command 
5 from VLIW 21 to proceed with the next slice or picture. 

Variable Length Decoding and transfer of decoded data: 
Figure 6 is a flow diagram showing process steps for macroblock level 
decoding by VLD 24 according to another aspect of the present invention. 

In step S601, VLD 24 receives a macroblock, designated for illustration 
10 purposes as MB (i). VLD 24 receive MB ©stored in VBV 25 based uponVUW21 
command to DSO 27. Macroblock data is transferred from VBV 25 viaDSO 27 using 
buses 34 and 47. Macroblock data is stored in an input buffer (not shown) in GetBits 
Engine 28 and then transferred to VLD 24 for decoding. As shown in Figure 5 , 
macroblock MB® has a header and DCT coefficients for macroblock MB (i-1). 

15 In step S602, VLD 24 decodes DCT coefficients for MB (i- 1 ), and also 

decodes macroblock header for MB (i), designated as HDR (i), using MPEG decoding 
techniques, incorporated herein by reference, and stores the decoded DCT coefficients 
and the decoded header in CM1 29 memory buffer BO. 

In step S603, VLD 24 transfers decoded header HDR (i) and DCT coefficients 
20 of MB (i-1) from CM1 29 memory buffer, BO to data cache 22 memory buffer, 
MB_BF (Figure 2A) via DS1 31 and buses 32 and 33 respectively. 

In step S604, VLD 24 receives MB 0+1 ) data, and decodes DCTs for MB (i) 
and MB 0+1) header, using MPEG decoding techniques incorporated herein by 
reference, and stores the decoded data in CM1 29 memory buffer BL The decoding 
25 process in step S604 and the transfer step of S603 are done simultaneously in parallel, 
and hence improves overall system performance. 
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In step S605A, VLD 24 verifies if the transfer from CM1 29 memory buffer 
BO, in step S603 is complete. If title transfer is not complete, then in step S606B, VLD 
24 waits till transfer from BO is complete. 

If step S603 transfer is complete, then in step S606, VLD 24 transfers the 
5 decoded MB(B-1) header and decoded DCT coefficients for MB CO, from CM 1 29 
memory buffer Bl to data cache 22 memory buffer MBJB1" via DS1 31 using buses 
32 and 33, respectively. The foregoing steps (S601 to S606) are repeated till the last 
macroblock is reached 

In step S607, VLD 24 decodes the last macroblock designated as MB Q) 
10 header and DCT coefficients for the last but one macroblock MB (1-1), and stores the 
decoded data in CM1 29 memory buffer. 

In step S608, VLD 24 transfers the decoded MB (1) header and decoded DCT 
coefficients for MB 0-1) from CM1 29 memory buffer to data cache 22 via DS1 31 
using buses 32 and 33, respectively. 

15 In step S609, VLD 24 decodes DCTs for MB (1) and stores the DCTs with a 

dummy header in CM1 29. 

In step S6 1 0, VLD 24 transfers decoded DCTs for MB (1) and the dummy 
header from CMl 29 to data cache 22viaDSl 31 using buses 32 and 33 respectively. 

In step S611, VLD 24 waits for the next slice in the input MPEG video stream 
20 14 from GetBits engine 28. VLIW 21 indicates to VLD 24 which slice code 

corresponds to the next slice that is to be decoded, thereby enabling skipping slices or 
even moving to the next picture. 

Figure 6 process steps optimize MPEG decoding and data transfer because the 
decoded header of a current macroblock (MB (i)) and DCT coefficients of a previous 
25 macroblock (MB (i-1)) are packed together in the same memory buffer. Also, the 

decoding of a current macroblock is performed in parallel with data transfer from CM 1 
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29 memory buffer to data cache 22. Furthermore, VLD 24 stops decoding when VLD 
24 encounters an error due to fake slice code (Figure 3) and waits for VLIW 21 
commands, hence error handling is efficiently controlled by a central processor. 

Motion Compensation and Motion Vector reconstruction: 
S Figure 7 shows process steps according to another aspect of the present 

invention for performing motion compensation and motion vector reconstruction, for 
outputting decoded MPEG video stream 1 7 as raw video data. 

In step S701, VLIW 21 commands DS2 38 via command line 43 to get 
reference data for a macroblock, e.g., MB (i) from frame buffer 40. 

10 In step S702, DS2 loads reference data from frame buffer 40 to data cache 22, 

via buses 46 and 41 respectively and in parallel in step S703A, VLIW 2 1 reconstructs 
motion vector for MB 0-2). Motion vector data is stored in data cache 22, after VLD 
24 decodes macroblock header and macroblock data, as discussed in Figure 6 above. 

In step S703B, VLIW 21 performs motion compensation and inverse discrete 
15 cosine transforms (IDCT) for MB (i-1) using well known MPEG techniques. It is 
noteworthy that step S703B occurs in parallel with S703A, if in step S702 data is still 
being loaded. 

In step S704, VLIW 21 outputs decoded MB (i) BDCTs and motion 
compensation data as raw video data to frame buffer 40, from data cache 22 via DS3 
20 39 and buses 42 and 43, respectively. 

The advantage of the foregoing steps is that VLIW 21 can perform parallel 
processing in steps S703 A and S703B. Loading reference data values into data cache 
memory 22 for an upcoming macroblock motion compensation and reconstruction 
operations can take considerable time. As shown above, during this downloading 
25 process, VLIW 2 1 processor can perform motion compensation and/or IDCTs on the 
DCTs of a previously decoded macroblock, and hence improve the overall decoding 
process. Furthermore, three macroblocks of data are processed with only two memory 
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buffers. 

Data Transfer Descriptors 

Data transfer from, and to the various memory buffers is accomplished by 
using set of descriptors. Numerous sets of data descriptors are used for transferring 
5 data from one memory buffer to another in the foregoing decoding system. A set of 
descriptors include a source descriptor describing the data source and a destination 
descriptor describing where and in what format the data is transferred. 

A set of descriptors is used to transfer data from CM1 29 to data cache 22 and 
another set for transferring data from data cache 22 to CM1 29. Another set of 

10 descriptors is used to transfer data from data cache 22 to Two other 

set of descriptors are used to transfer data from data cache 22 to frame buffer 40 as 
well as transfer from frame buffer 40 to data cache 22. An example of "source" and 
destination descriptors is provided below. It is noteworthy the examples below are to 
illustrate data descriptors and are not to limit the present invention. Other data 

15 descriptor formats may be used to implement the various aspects of the present 
invention. 

Data from CM1 29 memory buffers BO and Bl is transferred by using a Source 
Descriptor Set ("SDS") that includes descriptors 1 and 2. Descriptor 1 includes 
instructions to read from CM1 29 buffer, e.g. BO, using a mode, e.g., non-coherent and 
20 having a width, e.g., 832 bytes. Descriptor 2 has instructions to read from a buffer, 
e.g., B0' in cache memory 22, using a mode, e.g., coherent allocate, with a width of 64 
bytes and a pitch of -64 bytes and a "hah after transfer" control instruction. The -64 
byte pitch means that the buffer will be read repeatedly 1 3 times to equal the 832 bytes 
to zero out CM1 29 memory buffer. 

25 Each data transfer also has a Destination Descriptor Set ("DDS"). DDSfor 

data transfer from CM1 29 includes instructions to write to a destination buffer, e.g., 
B0' in cache memory 22, in a particular mode, e.g., coherent allocate, with a width of 
832 bytes and a control instruction "no halt after transfer." DDS for transfer of data 
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from data cache 22, includes instructions to write to a buffer, e.g., BO in CM1 29 in a 
mode, e.g., non-coherent, with a width of 832 bytes, and a control instruction, e.g., "no 
haft after transfer." DDSs from CM1 29 designate buffers MBJBO', MB_B1 * and 
MBJB2' in data cache 22 sequentially. Also DDSs from data cache 22 designate 
5 CM1 29 memory buffers BO and Bl sequentially. 

Task Synchronization Loops: 

Figures 8A-8 L show process steps for the computer programmed operation of 
the decoder according to yet another aspect of the present invention, with groups of 
operations being performed simultaneously. Efficient scheduling in processing 
10 macroblock data is essential to optimize VLIW 21 and VLD 24 usage. 

Various VLIW 21 processes and DMA transfers are incorporated in one trace 
i.e. motion vector reconstruction, motion compensation and IDCTs are performed 
continually with ongoing transfers without semaphore waits. A trace is a sequence of 
operations that are scheduled together. Traces are limited by module boundary 
15 (entry/return), loop boundary and previously scheduled code. Furthermore, all VLIW 
21 execution components, motion compensation transfers, VLD 24 DMA transfers and 
output buffer transfers overlap for achieving TnaviiTmm parallelism. 

For illustration purposes, Figure 8 A-8L process steps show decoding and 
DMA transfers for macroblocks, designated as MBO, MB1, MB3, MB4, MBS and 
20 MB6 and MB7. This illustration is not to limit the invention and is only to show how 
different components operate within a continuos time loop to achieve optimum 
efficiency. Figure 8A-8L process steps also show how decoded raw video data can be 
transferred to frame buffer 40 while ot^ and VLD 24 processes steps are 

being performed. 

25 Figure 8A 

In step S 800 A, VLIW 2 1 parses MPEG video bitstream 1 4 at a picture and 
slice level. VLIW 21 also setsupDSO 27 for transfering bitstream 14 to VLD 24 via 
Getbits engine 28. In parallel, in step S800B, VLIW 21 sends a "Go" command to 
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VLD 24, after VLD 24 has been initialized Thereafter in step S800C slice processing 
begins and in step S800D VLIW 2 1 sends slice code for a given slice to VLD 24 by 
writing to CM1 29. 

In step S801A, VLD 24 receives slice code and decodes MBO header, and 
S saves the decoded header in CM 1 29 memory buffer, BO. 

In step S801B, VLD 24 waits for DS1 31 to be ready for data transfer, and for 
a "GO* command from VLIW 21. VLD 24 also sends a "continue" command to DS1 
31 to transfer CM1 29 memory buffer BO data (i.e. decoded header of MBO) with 
dummy coefficients to data cache 22 memory buffer MBJBO'. 

10 In step S802A, DS1 31 transfers decoded MBO header data from CM1 29 

memory buffer, BO to data cache 22 memory buffer, MBJBO 9 , and in parallel, in step 
S802B, VLD 24 decodes DCT coefficients of MBO and the header for MB 1 , and 
saves the decoded data in CM1 29 memory buffer Bl. 

It is noteworthy that DS1 31 data transfer and VLD 24 decoding of MBO DCT 
IS coefficients and MB1 header occur simultaneously, and hence improves efficiency. 

Figure 8B 

In step S803A, VUW 21 sends a "GO" command to VLD 24 to proceed with 
the next macroblock, and VLIW 21 also waits for DS1 31 transfer in step S802A. In 
parallel, in step S803B, VLD 24 waits for DS1 31 to finish transfer of data from 
20 memory buffer BO On step S802A) and wait for a "GO" command from VLIW 2 1 . 
VLD 24 also sends a "continue" command to DS1 31 to start transfer of decoded DCT 
coefficients of MBO and decoded header of MB1 from CM1 29 memory buffer Bl to 
data cache 22 memory buffer, MB JB 1 \ after data transfer in step S 802A. 

In step S803C, VLIW 21 reconstructs motion vector based upon decoded MBO 
25 header data stored at data cache 22 memory buffer MBJBO'. VLIW 21 also set's up 
descriptors for transfer of reference data from frame buffer 40 to data cache 22 for 
motion compensation of MBO. 
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In step S803D, DS1 31 transfers data stored in CM1 29 memory buffer Bl ( 
i.e. decoded DCT coefficients ofMBO and decoded header of MB1) to data cache 22 
memory buffer, MB-BF. 

In step S803E, after receiving the "GO" command from VLIW 2 1 , VLD 24 
5 decodes DCT coefficients of MB1 and header for MB2, and saves decoded DCTs of 
MB1 and header MB2 in CM1 29 memory buffer, BO. It is noteworthy that process 
steps S803C-S803E occur simultaneously, and while data is being transferred from 
CM1 29 buffer B 1 in step S803D, VLD 24 decodes DCT coefficients and header of 
the next macroblock. Hence process steps for decoding, data transfer and storage of 
1 0 decoded data are synchronized to minimize VLD 24 idle time. 

Figure 8C: 

Steps 804A-F show various operations performed by VLD 24 and VLIW 2 1 
simultaneously, while various DMA channels transfer data. The various process steps 
as discussed below are synchronized to minimize time delay. 

15 In step S804A, VLIW 21 waits for DS1 3 1 to transfer data (in step S803D), 

and sends a "GO" command to VLD 24 to proceed with the next block. VLIW 21 also 
sends a continue command to DS2 38 to transfer reference data from frame buffer 40 
to data cache 22 memory buffer MC-BO. 

In step S804B, parallel to step S804A, VLD 21 waits for DS 1 transfer in step 
20 S803D, and for a "Go" command from VLIW 21 . VLD 21 also sends a "continue" 
command to DS1 29 to transfer CM1 29 memory buffer BO data (Le. decoded DCT 
coefficients for MB 1 and decoded header for MB2) to data cache 22 memory buffer, 
MBB2'. 

In step S804C, VLIW 21 reconstructs motion vector for MB1 based upon the 
25 decoded MB 1 header data stored in data cache 22 memory buffer, MB-B 1 \ VLIW 21 
also set's up the descriptor set for DS2 3 8 to transfer reference data for motion 
compensation for MB1 . 



19 



WO 01/52539 



PCT/US00/35287 



In step S804D, in response to the "continue" command from VLIW 21, DS2 
3 8 transfers reference data for MB 0 from frame buffer 40 to data cache 22 memory 
buffer, MCB0\ 

In step S804E, DS1 3 1 transfers data (decoded DCT coefficients for MB1 and 
5 header for MB2) from CM 1 29 memory buffer BO to data cache 22 memory buffer, 
MBJB2\ 

In step S804F, VLD 24 decodes DCT coefficients for MB2 and header for 
MB3, and stores the decoded DCT coefficients and decoded header in CM1 29 
memory buffer, B 1 . 

10 It is noteworthy that process steps S804C to S 8 04F occur in parallel, and hence 

improve the overall efficiency of the decoding process. 

Figure 8D 

In step S805A, VLIW 21 waits for DS 1 3 1 data transfer in step S804E, and 
sends a "GO" command to VLD 24 to proceed with the next macroblock. VLIW 21 
1 5 also waits for DS2 38 transfer of reference data for MBO in step S 804D, and also sends 
a "continue" command for transfer of reference data for MB 1 . 

Parallel to step S805A, in step S805B, VLD 24 waits for DS1 3 1 data transfer 
in step S804E, and for a "GO" command from VLIW 2 1 to proceed with the next 
macroblock. VLD 24 also sends a "continue" command to DS1 31, to transfer data 
20 from CM1 29 memory buffer, Bl after step S804E. 

In step S805C, VLIW 21 reconstructs motion vector for MB2 based upon 
decoded data stored in data cache 22 memory buffer, MB JB2 \ and set' s up 
descriptors for DS2 38 to transfer reference data for MB1 motion compensation. 
Thereafter, VLIW 22 performs motion compensation for MBO based upon reference 
25 data stored in data cache 22's memory buffer MC_B0\ and perform IDCTs for MBO 
based upon decoded DCT coefficients stored in MB-B1\ Thereafter, VUW 21 adds 
IDCTs and motion compensation data, and saves the MBO IDCTs and motion 
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compensation data in data cache 22, output buffer, OutJBO*. 

In step S805D, DS2 38 loads reference data for MB 1 to data cache 22 memory 
buffer, MCB1\ 

In step S805E, DS1 31 transfers decoded DCT coefficients of MB2 and 
5 decoded header of MB3 from CM1 29 memory buffer Bl to data cache 22 memory 
buffer, MBJB0\ 

In step S805F, after receiving the "GO" command from VLIW21, VLD 24 
decodes DCTs for MB3 and header for MB4, and stores the decoded DCT coefficients 
and decoded header in CM1 29 memory buffer BO. 

10 It is noteworthy that steps S805C-S805F occur simultaneously and improves 

parallelism between VLD 24 and VLIW 2 1 while efficiently transferring data using 
DMA channels DS1 31 and DS2 38. 

Figure 8B 

In step S806A, VLIW 21 sends a ""GO" command to VLD 24, and waits for 
15 DS1 31 transfer in step S805K VLIW 21 also sends a "continue" command to DS3J) 
39 to transfer decoded MBO data from data cache 22 Output buffer, Out BO' to 
SDRAM frame buffer 40, and to DS2 3 8 to load reference data for MB2 from 
SDRAM frame buffer 40 to data cache 2Z 

Parallel to step S806A, in step S806B, VLD 24 waits for DS 1 3 1 to transfer in 
20 step S805C, and waits for a "GO" command from VLIW 21. VLD 24 also sends a 
"continue" command toCMl 29 memory buffer Bl , to transfer data after step S805C. 

In step S806C, VLIW 21 reconstructs motion vector for MB3 based upon 
decoded MB3 data stored in data cache 22 memory buffer, MB-B0', and set's up 
descriptors for DS2 38 to load MB3 reference data Thereafter, VLIW 21 performs 
25 motion compensation and DDCTs for MB1 based upon reference data stored in 
MC_B1 * and DCT coefficients stored in data cache 22 memory buffer, MBB2' 
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respectively- VLIW 21 also adds IDCTs and motion compensation data for MB1, and 
saves the added data in data cache 22, Output memory buffer, OutJBl \ 

In step S806D, DS2 38 transfers reference data for MB2 from frame buffer 40 
to data cache 22 memory buffer, MCB0\ 

5 In step S806E, DS3_0 39 transfers MBO decoded pixels from data cache 22 

output buffer, OutBO* to frame buffer 40. 

In step S806F, DS1 31 transfers data decoded header for MB4 and DCT 
coefficients for MB3 from CM1 29 memory buffer, BO to data cache 22's memory 
buffer, MB_B1\ 

10 In step S806G, VLD 24 decodes MB4 DCT coefficients and header for MBS, 

and thereafter saves the decoded data in CM 1 29 memory buffer BL 

It is noteworthy that steps S806C-S806G occur simultaneously and hence 
improves VLIW pipeline as well parallelism between VLD 24 and VLIW 21, while 
efficiently transferring data using various DMA data transfer channels. 

15 Figure 8F 

Figure 8F show that in step S807A, DS3_0 39 continues to transfer (From 
Figure 8E) decoded pixel data of MBO from data cache 22 output memory buffer, 
OUTJB0 > to frame buffer 40, while other VLD 24 and VLIW 2 1 operations are being 
performed* 

20 In step S807B, VLIW 21 waits for DS 1 3 1 to finish data transfer in step 806D, 

and sends a "GO" command to VLD 24. VLIW 21 also waits for DS2 38 to transfer 
reference data for MB2 in step S806D, and data transfer by DS3 0 39 in step S807A. 
VLIW 21 also sends a continue command to DS2 38 (for transfer of reference data for 
MB3) and to DS3J 39 for transfer of decoded data from data cache 22 output 

25 memory buffer, OutJBl* after step S807A. 

In step S807C, VLD 24 waits for DS1 31 data transfer is step S806F, and waits 
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for a "GO" command from VLIW 21 to proceed with the next macroblock. VLD24 
sends a continue command to DS1 31 to transfer data from CM1 29 memory buffer, 
BO after data transfer from memory buffer Bl in step S806F. 

It is noteworthy that steps S807A-S807C occur simultaneously. 

5 In step S807D, VLIW 2 1 reconstructs motion vector for MB4 based upon 

decoded MB4 data stored in data cache 22 memory buffer, MB_B 1 \ and sets up 
descriptors for DS2 38 to transfer reference data farMB4. VLIW21 also performs 
motion compensation for MB2 based upon reference data stored in data cache 22 
memory buffer, MCB0\ and also performs BDCTs for MB2 based upon decoded 
10 DCT coefficients stored in data cache 22 memory buffer, MB_B0\ VUW21 adds the 
IDCTs and motion compensation results and saves the added data in data cache 22 
output memory buffer, OUTB2\ 

In step S807E, DS2 3 8 transfers reference data for MB3 from frame buffer 40 
to data cache 22 memory buffer, MC_B1 \ 

15 In step S807F, DS3_1 39 transfers decoded pixels for MB 1 from data cache 22 

output memory buffer, Out-B 1 * to frame buffer 40. 

In step S807G, DS 1 3 1 transfers decoded header for MB5 and decoded DCT 
coefficients for MB4 from CM1 29 memory buffer Bl to data cache 22 memory 
buffer, MB_B2\ 

20 In step S807H, after receiving a "GO" command from VLIW 21, VLD 24 

decodes DCT coefficients for MBS, and decodes the header for MB6. VLD 24 saves 
the decoded MBS DCT coefficients and MB6 header in CM1 29 memory buffer, B0. 

It is noteworthy that steps S807A, S807D-S807H occur in parallel. 

Figure 8G 

25 In step S808A, DS3_1 39 continues to transfer of decoded MB1 pixels. 
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In step S808B, VLIW 21 waits for DS1 31 data transfer in step S807G, and 
also sends a "GO" command to VLD 24 to proceed with the next macroblock. VLIW 
21 also waits for DS2 38 transfer in step S808E, and sends a "continue" command to 
DS2 38 to transfer reference data for MB4. VLIW 21 also waits for DS3_0 to output 
5 data to frame buffer 40 in step S 807A and sends a "continue" command to DS3_2 39 
to transfer MB2 decoded pixel data from data cache 22 memory buffer, OutJB2' to 
frame buffer 40. 

In step S808C, VLD waits for DSI 31 transfer in step S807G, and for a "GO" 
command from VLIW 21 to proceed with the next macrobloct VLD 24 also sends a 
10 "continue" command to DSI 31 to transfer data from CM1 29 memory buffer BO, after 
step S807G. 

In step S808D, VLIW 21 reconstructs motion vector for MB5 from data stored 
in data cache 22 memory buffer MBJB2\ and set's up descriptors for DS238 to 
transfer reference data for MBS. VLIW 21 performs motion compensation and IDCTs 
15 for MB3 based upon reference data stored in MCJBF and decoded DCT coefficients 
stored in data cache 22 memory buffer, MB_B1 * respectively. Thereafter, VLIW 21 
adds the IDCTs and motion compensation data, and saves the data in data cache 22 
output memory buffo; OutJBO'. 

In step S808E, DS2 38 transfers reference data for MB4 from frame buffer 40 
20 to data cache 22 memory buffer, MCB0\ 

InstepS808F,DS3_2 39 starts transfer of decoded pixels for MB2 to frame 
buffer 40. It is noteworthy that data transfers in steps S807A, 808A and 808F occur 
simultaneously. Hence the three paths of DS3 39 ie. DS3_0, DSJ and DS_2 can 
simultaneously transfer decoded MPEG video stream to frame buffer 40. 

25 Instep S808G, DS 1 3 1 transfers decoded header for MB 6 and DCT 

coefficients for MBS from CM1 29 memory buffer B0 to data cache 22 memory 
buffer, MB_B0'. 
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In step S808H, after receiving the "GO" command from VLIW21, VLD 24 
decodes the header for MB7and DCT coefficients for MB6, and stores the decoded 
data in CM1 29 memory buffer Bl. 

It is noteworthy that process steps S808A, S808D and S808C occur 
5 simultaneously. Also steps S808A and S808C-S808H occur simultaneously. 

Figure 8H 

In step S809A, DS3_2 39 continues to transfer decoded MB2 pixels from data 
cache 22 output buffer, Out_B2\ 

In step S809B, VLIW 21 waits for DS1 31 data transfer in step S808G, and 
10 also sends a "GO" command to VLD 24 to proceed with the next macroblock. VLIW 
21 also waits for DS2 3 8 transfer in step S808E, and sends a "continue" command to 
DS2 38 to transfer reference data for MB5. VLIW 21 also waits for DS 3 J) to output 
data to frame buffer 40 in step S 807 A and sends a "continue" command to DS3 J) 3 9 
to transfer MB3 decoded pixel data from data cache 22 memory buffer, Out_B0' to 
15 frame buffer 40. 

In step S809C, VLD waits for DS1 31 transfer in step S808G, and for a "GO" 
command from VLIW 21 to proceed with the next macroblock. VLD 24 also sends a 
"continue" command to DS1 31 to transfer data from CM1 29 memory buffer Bl, after 
step S808G. 

20 In step S809D, VLIW 21 reconstructs motion vector for MB6 from data stored 

in data cache 22 memory buffer, MB_B0\ and set's up descriptors for DS2 38 to 
transfer reference data for MB6. VLIW 21 performs motion compensation and IDCTs 
for MB4 based upon reference data stored in MC B0' and decoded DCT coefficients 
stored in data cache 22 memory buffo:, MB B2* respectively. Thereafter, VLIW 21 

25 adds the IDCTs and motion compensation data, and saves the data in data cache 22 
output memory buffer, OutJBl\ 



In step S809E, DS2 3 8 transfers reference data for MB 5 from frame buffer 40 
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to data cache 22 memory buffer, MCJBF. 

In step S809F, DS3_0 39 starts transfer of decoded pixels for MB3 to frame 
buffer 40. 

In step S809G, DS 1 3 1 transfers decoded header for MB 6 and DCT 
5 coefficients for MB6 from CM1 29 memory buffer Bl to data cache 22 memory buffer 
MB_B1\ 

In step S809H, VLD 24 decodes DCT coefficients for MB7, and stores the 
decoded DCT coefficients and a dummy header in CM1 29 memory buffer BL VLD 
24 performs this operation if macroblock MB7 is the last macroblock in the slice. The 
1 0 dummy header may have a flag that indicates the end of a slice. Thereafter, VLD 24 
finds a particular start code based upon start code sent by VLIW 2 1 . 

It is noteworthy that process steps S809A-S809C occur simultaneously. Also 
process steps S809D-S809H occur simultaneously. 

Figure 81 

15 In step S810A, DS3_0 39 continues to transfer decoded MB3 pixels from 

output buffer Out_B0\ 

In step S81 OB, VLIW 21 waits for DS1 31 data transfer in step 809G, and also 
sends a "GO" command to VLD 24. VLIW 21 also waits for DS2 38 transfer in step 
S809E, and sends a "continue" command to DS2 38 to transfer reference data for 
20 MB6. VLIW 21 also waits for DS 3_2 to output data to frame buffer 40 in step S 
809 A, and sends a "continue" command to DS3_1 3 9 to transfer MB4 decoded pixel 
data from data cache 22 memory buffer Out_B 1 9 to frame buffer 40. 

In step S810C, VLD waits for DS1 31 transfer in step S809G, and for a "GO" 
command from VUW 21 to jm>ceed with the next macroblock VLD 24 also sends a 
25 "continue" command to DS1 31 to transfer data from CM1 29 memory buffer B0, after 
step S809G. 
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In step S810D, VLTW 21 reconstructs motion vector for MB7 from data stored 
in data cache 22 memory MBJB1\ and set's up descriptors for DS2 38 to transfer 
reference data for MB7. VLIW 21 also performs motion compensation and BDCTs for 
MB5 based upon reference data stored in MC BT and decoded DCT coefficients 
5 stored in data cache 22 memory buffer, MB JBO' respectively. Thereafter, VLIW 21 
adds the BDCTs and motion compensation data, and saves the added data in data cache 
22 output memory buffer, Out_B2\ 

In step S810E, DS2 38 transfers reference data for MB6 from frame buffer 40 
to data cache 22 memory buffer, MCJB0\ 

10 InstepS810F,DS3_l 39 starts transfer of decoded pixels for MB4 to frame 

buffer 40. 

In step S810G, DS1 31 transfers a dummy header and DCT coefficients for 
MB7 from CM1 29 memory buffer BO to data cache 22 memory buffer. MB_B2\ 

It is noteworthy that process steps S810A -S810C occur simultaneously. Also 
process steps S810A and S810D-S810G occur simultaneously. 

Figure 8J 

In step S811A, DS3_1 39 continues to transfer decoded MB4 pixels from 
output buffer, Out_Bl\ 

In step S811B, VLIW 21 waits for DS1 31 data transfer in step 810G, and also 
sends a "GO* command to VLD 24 to proceed with the slice or picture. VLIW 21 
also waits for DS2 3 8 transfer in step S810E, and sends a "continue" command to DS2 
38 to transfer reference data for MB7. VLIW 21 also waits for DS 3 J) to output data 
to frame buffer 40 in step S 8 1 OA, and sends a "continue" command to DS3_2 3 9 to 
transfer MBS decoded pixel data from data cache 22 memory buffer Out_B2* to frame 
buffer 40. 

InstepS811CVLIW21 recognizes MB 7 as the last macroblock . VLIW 21 
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perfonns motion compensation and IDCTs for MB6 based upon reference data stored 
in MCJBO* and decoded DCT coefficients stored in data cache 22 memory buffer, 
MB _B1* respectively. Thereafter, VLIW 21 adds the IDCTs and motion 
compensation data, and saves the data in data cache 22 output memory buffer, 
5 OutJB0\ 

In step S8 1 1 D, DS2 3 8 transfers reference data for MB 7 from frame buffer 40 
to data cache 22 memory buffer, MC_B 1 \ 

InstepS810E>DS3_2 3 9 starts transfer of decoded pixels for MBS to frame 
buffer 40. 

10 It is noteworthy that process steps S81 1A and S811B, as well as Steps S81 1C- 

S81 IE occur simultaneously. 

Figure 8K 

In step S812A, DS3J2 39 continues to transfer decoded MBS pixels from 
output buffer, Out_B2 > to frame buffer 40. 

15 InstepS812B,VUW 21 waits for DS2 38 transfer data in step S811C VLIW 

21 also waits for DS 3_1 to output data to frame buffer 40 in step S 811 A, and sends 
a "continue" command to DS3_0 3 9 to transfer MB6 decoded pixel data from data 
cache 22 memory buffer, OutBO' to frame buffer 40. 

In step S812C, VLIW 21 performs motion compensation and IDCTs for MB7 
20 based upon reference data stored in MCJB 1 * and decoded DCT coefficients stored in 
data cache 22 memory buffer, MBJB2' respectively. Thereafter, VLIW 21 adds the 
IDCTs and motion compensation data, and saves the added data in data cache 22 
output memory buffer, OutJBl'. 

InstepS812D,DS3J) 39 starts transfer of decoded pixels for MB6 to frame 
25 buffer 40. 

It is noteworthy that process steps S812A and S812B as well as Steps S812A 
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and S812C-S81 ID occur simultaneously. 
Figure 8L 

In step S813A, VLIW 21 sends a continue command to DS3JI 39, to transfer 
data for MB7. VLIW 21 also checks for start code for the next slice/picture. If the 
5 start code is not fake then in step S813B, the process moves back to step S801 A in 
Figure 8 A. 

In step S 8 1 3C, if the next slice code is fake slice code, then VLIW waits for 
DS3 J), DS3_1 and DS3_2 39 transfers to finish. 

In step S813D, DS3_1 transfers decoded data of MB7 to frame buffer from 
10 data cache ouput buffer, OUTJB1 \ 

In step S813E, the process goes to the next picture and process steps in Figure 
8A-8L are repeated for the next picture. 

The process steps of Figure 8 illustrate a timing loop that synchronizes data 
decoding, data storage and data transfer by VLD 24, VLIW 21 and various DMA 
IS channels, e.g. DS1 31, DS2 38 and DS3 39. Figure 8 process steps illustrate 

simultaneous data transfer of decoded MPEG video for three macroblocks MBO, MB1 
and MB2 based upon the three paths in DS3 39 namely DS3-0, DS3-1 and DS3_2. 
This is merely to illustrate one aspect of the invention, other DMA transfers units with 
more than or less than three channels may be used to transfer raw video data. 

20 The present invention has numerous advantages over the existing art 

According to one aspect of the present invention, the decoding of an entire picture 
with a macroblock data including the header for a current macroblock and DCT 
coefficients of a previous macroblock assists in maintaining continuos pipelined 
operation. 

25 The foregoing aspects of the invention simplify the decoding and 

reconstruction process because VLD 24 decodes a macroblock header for a current 
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macroblock MB(i) and stores the decoded header data with a macroblock already 
decoded (MB(i-l), and transfers the decoded header and macroblock data (DCTs) to 
data cache 22 for access by VLIW 21. This enables VLIW 21 to acquire reference data 
for a macroblock prior to performing motion compensation and DOCTs, e.g., when 
5 VLD 24 sends macroblock DCTs for MB2 and header for MB3 , then VLIW 2 1 can 
acquire reference data for MB3 prior to performing motion compensation and ID CTs. 
This reduces idle time and improves decoding efficiency. 

Furthermore, while data transfers occur via the various DMA channels, VLIW 
21 and VLD 24 simultaneously perform various operations as discussed This also 
10 improves the overall efficiency of the process. 

The present invention has been described in general terms to allow those 
skilled in the art to understand and utilize the invention in relation to specific preferred 
embodiments. It will be understood by those skilled in the art that the present 
invention is not limited to the disclosed preferred embodiments, and may be modified 

15 in a number of ways without departing from the spirit and substance of the invention 
as described and claimed herein. For example VLIW 21 processor of the present 
invention is believed to be the most convenient processor architecture for use with the 
variable length decoder to achieve maximum parallelism and improve efficiency in 
MPEG decoding. However, other processors of the RISC or CISC type architecture 

20 may be optimized to be used as the VLIW discussed in this application. 

The foregoing aspects of present invention are not limited to MPEG -1 or 
MPEG-2 decoding, MPEG-4 can also be decoded by the foregoing process steps. 
Furthermore, the foregoing aspects of the present invention are not limited to MPEG. 
The foregoing aspects of the present invention are applicable wherever there is a need 
25 for efficient synchronization data exchange between a processor and a co-processor, or 
between portions of a processor for purposes of maintaining coherence, accuracy and 
parallelism. 



In addition, currently the core processor 20 and co-processor 23a are on the 
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same integrated circuit chip. However, the foregoing aspects of the present invention 
will be applicable to other integrated circuits even if both the core processor and co- 
processor are not on the same chip. 

Furthermore, the present invention can be implemented essentially in software. 
5 This is possible because software can dynamically create and maintain virtual 

buffering, implement variable length decoding as well as discrete cosine transforms, 
and the like. Hence, the foregoing aspects of the present invention can be 
implemented essentially in software running on a general-purpose programmable 
microprocessor/computer and still retain the spirit and substance of the present 
10 invention, as more fully expressed in the attached claims. 
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We claim: 

1 . A method for decoding and reconstructing an input MPEG video data stream, 
comprising the steps of: 

5 decoding the encoded video data of a first macroblock whose header has been 

decoded, and decoding the header of a second macroblock; wherein each macroblock 
includes a header for a current macroblock and encoded MPEG video data of a 
previous macroblock block whose header has been decoded; and 

performing motion vector reconstruction on the decoded first macroblock 
10 video data, based upon the decoded macroblock header data. 

2. The method of Claim 1, further comprising the step of: 

modifying the first decoded macroblock video data by motion compensation 
based upon reference data ofa previously decoded macroblock. 

15 

3. The method of Claim 1, further comprising of: 

performing the preceding steps until an entire slice has been reconstructed into 
raw video signal data. 

20 4. The method of Claim 1, further comprising of: 

performing the preceding steps until an entire picture has been reconstructed 
into raw video signal data. 

5. The method of Claim 1, wherein the decoding is performed by a variable 
25 length decoder (VLD) in a co-processor. 

6. The method of Claim 1, wherein the modifying step is performed by a core 
processor having a very long instruction word (VLIW) processor. 



30 7. The method of Claim 6, further comprising of: 
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transferring the decoded header of the second macroblock and decoded video 
data of the first macroblock from the VLD to the core processor; and 

pre-fetching from a memory buffer reference data utilized by the VLIW for 
motion compensation. 

8. The method of Claim 1 , further comprising the steps of: 

adding a fake slice start code and fake macroblock data at the end of each 
picture in the input MPEG video data stream; and 

utilizing the fake slice start code and fake macroblock data to skip to the next 

picture. 

9. The method of Claim 8, wherein the fake macroblock data indicates 

an error to the VLD stopping the decoding process until the core processor reinitiates 
decoding of a selected slice. 

10. The method of Claim 1, wherein the decoded macroblock data includes 
discrete cosine transform coefficients. 

1 1 . The method of Claim 1 , wherein the decoded macroblock data includes motion 
vectors. 

12. The method of Claim 1 0, wherein the modifying step includes performing 
inverse discrete cosine transformations on decoded discrete cosine transform 
coefficients for the decoded first macroblock 

13. The method of Claim 12, further comprising of: 

selecting a matching macroblock from another frame based upon the reference 
data for motion compensation; and 

adj usting the decoded first macroblock data according to the reference data for 
motion compensation. 
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14. A video decoder adapted to reconstruct raw video signal data from an input 
MPEG coded video data stream, comprising of: 

a core processor for parsing the input coded video data stream to identify start 
5 code and end location for slices and pictures, and the core processor is adapted to 
perform motion compensation on the encoded MPEG video data ; and 

a differential decoder in a coprocessor adapted to decode a macroblock header 
for the input MPEG video data for a second macroblock and decode the input MPEG 
video data for a first macroblock whose header is already decoded, wherein each 
10 macroblock includes a macroblock header for a current macroblock and encoded video 
data of a previous macroblock whose header has already been decoded 

15. The apparatus of Claim 14, wherein the core processor is adapted to modify the 
decoded video data of the first macroblock by motion compensation based upon 

15 reference data of a previously decoded macroblock. 

16. The apparatus of Claim 14, wherein the differential decoder is variable length 
decoder (VLD). 

20 17. The apparatus of Claim 14, wherein the core processor includes a very long 
instruction word (VLTW) processor. 

18. The apparatus of Claim 14, further comprising of: 

a first control unit for transferring the decoded header of the second 
25 macroblock and decoded video data of the first macroblock from the VLD to a data 
cache in the core processor; . 

a second control unit forpre-fetching from a memory buffer reference data 
utilized by the VLIW for motion compensation; and 

a third control unit for transferring decoded raw video data from the core 
30 processor data cache to a memory storage device. 
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19. The apparatus of Claim 17, wherein the VLIW adds a fake slice start code and 
feke macrobl ock data at the end of each picture in the input MPEG video stream, and 
the VLD utilizes the feke slice start code and feke macroblock data to skip to the next 

S picture. 

20. The apparatus of Claim 17, wherein the feke macroblock data indicates an error to 
the VLD stopping the decoding process until the core processor reinitiates decoding of 
a selected slice. 

10 

21. The apparatus of Claim 14, wherein the decoded macroblock data includes 
discrete cosine transform coefficients 

22. The apparatus of Claim 14, wherein the decoded macroblock data includes 
1 5 motion vectors. 

23. The apparatus of Claim 17, wherein the VLIW performs inverse 

discrete cosine transforms on decoded discrete cosine transform coefficients of the 
first macroblock. 

20 

24. Hie apparatus of Claim 14, wherein the co-processor includes a memory buffer 
for storing decoded header and decoded macroblock data. 

25. The apparatus of Claim 25, wherein the first controller transfers data from the 
25 co-processor data memory buffer to the core processor data cache. 

26. The apparatus of Claim 14, wherein the core processor data cache stores data 
received from the co-processor via the first controller and the reference data from the 
memory storage device via the second controller. 
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